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Abstract. Scene analysis, the process of converting sen- 
sory information from peripheral receptors into a represen- 
tation of objects in the external world, is central to our 
human experience of perception. Through our efforts to 
design systems for object recognition and for robot naviga- 
tion, we have come to appreciate that a number of common 
themes apply across the sensory modalities of vision, audi- 
tion, and olfaction; and many apply across species ranging 
from invertebrates to mammals. These themes include the 
need for adaptation in the periphery and trade-offs between 
selectivity for frequency or molecular structure with reso- 
lution in time or space. In addition, neural mechanisms 
involving coincidence detection are found in many different 
subsystems that appear to implement cross-correlation or 
autocorrelation computations. 


Introduction 


As we walk in a busy city or even a pristine forest, our 
senses are bombarded by signals from many sources. The 
acoustic signals enlering our ears are a mixture of sounds 
produced by many sources as well as innumerable echoes. 
The photons reaching our retina have been reflected off a 
complicated montage of clothing, faces, automobiles, and 
buildings or perhaps off a mixture of leaves, stems, Insects, 
birds, soil, and flowers. Likewise. the molecules reaching 
our olfactory epithelium may be a mixture of burnt hydro- 
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carbons, perfume, and the smell of decaying trash or a 
combination of fragrances from flowers, musk from ani- 
mals, and byproducts of the breakdown of leaves. We refer 
to the problem of interpreting this jumble of sensory input 
and relating it to the physical world as scene analysis. 

Many of the current ideas about scene analysis in general 
started with experimental and theoretical work on vertebrate 
vision. David Marr (1982) introduced a conceptual frame- 
work that spanned the entire range of issues from perception 
down through the physiological mechanisms to the actual 
underlying computations. The core idea is that sensory 
systems carry Out specific computations that can be de- 
scribed mathematically, and that if these computations are 
understood, then they can be implemented as computer 
programs or in electronic hardware. 

Our own approach to designing artificial systems for 
scene analysis follows Marr’s lead. We start with physio- 
logically based models that replicate the responses of the 
sensory receptors and neural structures that appear to be 
involved with the early stages of sensory processing. These 
models are then further abstracted to a form in which they 
can be used as the starting point for the design of very 
large-scale integrated circuits (VLSI). The VLSI circuits, 
after fabrication, are then integrated with appropriate sen- 
sors, and the outputs are fed to a microprocessor for tasks 
such as grouping, object localization, and object classifica- 
tion. 


Visual Scene Analysis 


Visual scene analysis in mammals is believed to take 
place through a series of parallel pathways (Fig. 1). The 
image projected by the lens onto the retina is transduced by 
photoreceptors, and then contrast is enhanced by neural 
processing before the visual information is split into spe- 
cialized pathways that appear to extract important features 
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Figure 1. Visual feature analysis consists of first transducing the light 
projected onto the photoreceptor array and enhancing the contrast of the 
projected image. This is followed by parallel pathways of feature extrac- 
tion, the outputs of which are then processed to group related elements to 
form visual objects. 


such as distance, orientation, velocity, color, and size (Marr, 
1982). Individual regions of the visual image are analyzed 
for these different features, and then selected portions are 
grouped together through selective attention to form visual 
objects that can be identified. 

Similar processes may also be taking place in inverte- 
brates. For example, cells from the third optic ganglion of 
dragonflies respond selectively to different target classes 
with properties that are remarkably similar to those of cells 
from the mammalian visual cortex (O’ Carroll, 1993). Also, 
bees—like mammals—can recognize a familiar shape un- 
der a variety of viewing conditions regardless of whether it 
is initially sensed by color contrast, luminance contrast, or 
motion contrast (Zhang et al., 1995). 

The visual system must be able to cope with the large 
changes in ambient Jight level that take place due to time of 
day, presence or absence of clouds, and moving in and out 
of the shade. Even with fixed lighting conditions, some parts 
of the visual scene may be brightly lit while others may be 
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in the shade. The image projected onto a receptor array is 
the product of the iliumination falling upon the objects 
within the visnal scene, multiplied by the reflectivities of 
these objects. Since it is the reflectivity (both overall mag- 
nitude and spectrum) that provides the useful information 
about object identity, the visual system needs a method to 
minimize the effects of varying illumination. 

These illumination problems must be dealt with in the 
first stages of processing, before object formation can take 
place. The large changes in ambient light level appear to be 
handled at the receptor level through adaptation. Adaptation 
is a process whereby the sensitivity of the photoreceptor 
depends on the time-averaged light level. In biological 
photoreceptors, biochemical processes provide the needed 
automatic gain contro]. The outputs of small groups of 
photoreceptors are then combined so as to enhance the 
differences in reflectivity of objects within the scene by 
using a “center-surround” organization (Fig. 2, column 1). 
This is done by combining an excitatory input from a 
receptor or small cluster of receptors with inhibitory inputs 
from the surrounding neighbors (on-center receptive field) 
or by combining an inhibitory input from a receptor or 
cluster with excitatory inputs from the surrounding neigh- 
bors (off-center receptive field). Mathematically, the com- 
bination of adaptation and center-surround organization is 
equivalent to performing the combination of local normal- 
ization and a two-dimensional second spatial derivative on 
the output of the receptor array. This process has the effect 
of emphasizing contrast boundaries in the image. The spa- 
tial extent of the receptors contnbuting to the receptive field 
can be varied at the design stage to achieve different degrees 
of resolution (image smoothing). Alternatively, the scene 
can be processed by parallel pathways each with a different 
resolution. If appropriate weights are used for the excitation 
and inhibition, then the center-surronnd spatial filters can be 
approximated mathematically as Gabor functions (Weldon 
and Higgins, 1999). The multi-resolution approach can be 
thought of as taking a two-dimensional wavelet transform ot 
the image (Porat and Zeevi, 1989). 

Distance information is not available to the visual system 
directly, because the external three-dimensional world is 
mapped onto a two-dimensional array of receptors. If a 
three-dimensional internal representation is needed, say for 
navigational purposes, then the third dimension must be 
synthesized from the information available from the recep- 
tors. If the system has two eyes with overlapping visual 
fields, then differences due to parallax between the images 
from the two eyes can be exploited (binocular disparity) to 
estimate distance; otherwise, vergence or more subtle cues 
must be used. To estimate binocular disparity. the visual 
system appears to perform a spatial cross-correlation be- 
tween corresponding regions of the two retinas (Marr, 
1982). 

Spatial cross-correlation is also used to detect motion. 
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Figure 2. Orientation processing consists of combining the outputs of celts with center-surround organiza- 
tion (column Í) to create oriented receptive fields (column H). These oriented receptive fields are then combined 


to form oriented edge detectors (column ffl). 


Coincidence detection between the output of a cell and the 
delayed outputs of other cells with nearby receptive fields is 
mathematically equivalent to computing the spatial cross- 
correlation between the current visual frame and a previous 
visual frame on a region-by-region basis. 

Orientation processing involves detecting lines and edges 
and estimating their angular orientation. Hubel and Wiesel 
(1962), working with cat visual cortex, showed that detec- 
tion of oriented edges can be accomplished by a sequence of 
processing stages that combine the outputs of groups of 
cells with similar center-surround characteristics. By using 
groups of cells arranged as short linear arrays, short linear 
segments of light or dark can be detected (Fig. 2, column 1). 
Different arrays have different orientations (orientation tun- 
ing), so that all possible edge segments within a region can 
be detected. If we then combine the output of pairs of these 
arrays that are slightly offset from each other and have the 
same Orientation but with one array being of the “on” type 
and the other being of the “off” type, we have a system that 
detects edge segments between areas of different reflectivi- 
ties (Fig. 2, column HI). This process can be performed a 
second time to detect line segments. Higher-level process- 
ing can then be used to group the edge or line segments into 
longer lines and arcs (Pasupathy and Connor, 1999). 

We have implemented this type of processing in silicon 
by designing a set of integrated circuits that implement the 
processing illustrated in Figure 2 (Hinck and Hubbard, 
1999). We do not have space here to go into the details of 
the silicon implementation, but one significant difference 
between the biological and silicon system must be men- 


tioned. In biological systems, the information between pro- 
cessing units (cells) is carried by axons that are self routing; 
in other words, they can work their way through the nervous 
tissue and find their targets. With silicon processing sys- 
tems, the wiring problem becomes serious. The processing 
described within a single column of Figure 2 only requires 
communication between nearby elements on the chip. How- 
ever, when we need to move information from one process- 
ing level or chip to another (from one column to another in 
Fig. 2), then we run into problems due to the sheer number 
of wires involved. To reduce this bottleneck, a technique 
known as address event representation (AER) ts used (Boa- 
hen, 2000). When a silicon cell is “excited,” it broadcasts its 
address (identity) to all listeners, which may be a one-to-one 
Or a one-to-many mapping. Each broadcast event is equiv- 
alent to the production of a single action potential (spike) in 
the biological system, and given the bandwidth (speed) of 
the circuitry we have the ability to transmit the identity of 
all the spikes from all the cells on a chip. Because the 
processing is taking place in real time, there 1s no need to 
record a time stamp for the events. For simulations that do 
not run in real time, each event may need both a time stamp 
and an address. 

With AER, signaling takes place only if a spike is gen- 
erated: this minimizes power consumption because, for a 
single cell, spikes are relatively rare events. This minimi- 
zation of power consumption is important, especially for 
small robots (as well as for biological systems), since low 
power consumption allows operation for longer periods of 
time without replenishment of energy stores. 
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Auditory Scene Analysis 


A major challenge in auditory scene analysis is that 
acoustic signals from different sources can overlap in direc- 
tion, frequency, and time. We believe that biological sys- 
tems meet this challenge by dividing up the received signals 
in frequency and time and—through the use of appropriate 
grouping principles—enhance the signal-to-noise ratio for 
individual sources to the point where the bearing and iden- 
tification of the source can be determined. In many appli- 
cations, both transient and long-duration signals are of in- 
terest. In auditory scene analysis, each frequency band can 
be analyzed for the presence of specific features, and then 
the grouping rules can be used to combine information from 
selected frequency bands to produce the features vector that 
represents an auditory object. 

Audition, unlike vision, has no method by which even 
two of the three physical dimensions of the external acoustic 
world can be projected directly onto the receptor array. To 
determine the direction of a sound source, one either needs 
to compare signals acquired by directional ears (micro- 
phones) with different orientations or compare measure- 
ments of pressure taken at different locations in space. In the 
latter case, the ears or microphones must be spaced suffi- 
ciently that the time delay due to the speed of sound 1s large 
enough to be sensed or measured. If only two ears or 
microphones are used, then directional ambiguities are 
present, but these can generally be resolved through rotation 
of the head or microphone array. The third dimension 
(source distance) is much more difficult to estimate in 
audition. Experiments with human listeners suggest that the 
ratio of direct to reverberant sound energy may be an 
Important distance cue. How this ratio might be estimated is 
not clear. 

Each frequency channel is analyzed in parallel through 
the computation of multiple features (Fig. 3). These features 
are likely to be similar for frequency channels that contain 
signals from the same sound source and are likely to differ 
for signals from different sound sources. For example, the 
differences in time delay between the arrival of the signals 
Qnteraural time differences, IFD) as well as differences in 
intensity (interaural intensity differences, IID) at two sen- 
sors will be similar across frequency channels for a single 
source because these features depend on source direction. 
Frequency components with similar onsets, offsets, dura- 
tion, and envelope period are also most likely to be from a 
single sound source. 

For many vertebrates, the head size is sufficient to create 
significant time delays (ITD) between the ears that can be 
used for localization; at higher frequencies the head shadow 
effect is large, producing a significant IID. For very small 
animals, especially insects, the ears are very close together, 
making ITD estimation via neural circuits impractical, and 
the animal’s size precludes creating a sound shadow. These 
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Figure 3. Auditory feature analysis consists of first filtering and trans- 
ducing the sound received by the penpheral organs. This is foltowed by 
parallel pathways of feature extraction. the outputs of which are then 
processed to group related elements to form auditory objects. 


animals appear to use mechanical or acoustic means, or 
both, to detect the subtle pressure differences between the 
two sides of their body (Michelsen, 1998). 

As was the case for visual processing, the final step 
before auditory source identification is the grouping process 
(Bregman, 1990), In each of the features maps described 
above, timing information is preserved. This enables the 
grouping process to use common bearing, as determined by 
the ITD and HD maps, and synchrony across maps as the 
major cues for grouping specific components together. This 
grouping process results in a simplified set of features that 
includes target direction, the major peaks in the target signal 
spectrum, and temporal features such as the period of the 
signal envelope. This set of features can then be compared 
to stored signatures to complete the identification process. 
Signatures in this context can be hardwired (acquired 
through evolution at the species level), learned through 
experience at the individual level, or derived from a com- 
bination of the two methods. 

lf the system is hardwired, then it is possible to imple- 
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ment the entire analysis/tracking system with simple cir- 
cuits. For example, the Webb and Scutt (2000) model of 
cricket phonotaxis implements pattern recognition and 
source localization with a system comprising two receptors 
followed by four neurons. The pattern of interest in this case 
is the mating cal] of the male, which is characterized by a 
limited range of carrier frequencies and a limited range of 
syllable repetition intervals (SRI) (modulation periods). Fil- 
tering for the appropriate carner frequencies takes place in 
the hearing organ, and subsequent filtering for SRI takes 
place using a pair (one for each ear) of output neurons that 
act as lowpass filters, followed by another pair of neurons 
that act as a highpass filters. Source localization 1s accom- 
plished by using directional ears and a combination of 
excitation and inhibition in the same neurons that perform 
the highpass filtering. 

For auditory scene analysis, it is essential that the filters 
that perform the frequency separation be designed to have 
impulse responses that are compact both in frequency and 
time. The performance measure commonly used to describe 
this feature is the time-bandwidth product. Simple, single 
mode resonances, although narrow in frequency, do not 
have good temporal performance and hence do not have 
good time-bandwidth products. The impulse response that 
achieves the theoretical ume-bandwidth product limit is a 
sinusoid with a Gaussian envelope (Gabor function). Such 
an impulse response is physically unrealizable, but it is 
possible to combine multiple resonances to create a re- 
sponse that comes close to the ideal. Also, for a general 
purpose signal processing system, it is generally better to 
use filters with a constant ratio of bandwidth to center 
frequency (constant Q) rather than a constant bandwidth 
like that obtained with a Fourier transform. The widespread 
use of approximately constant-Q filtering across the ears of 
many species ranging from bush crickets (Hoy, 1992) to 
mammals (Javel, 1986) suggests that this approach offers 
significant survival value. The use of a constant-Q filter 
bank is very similar mathematically to taking a wavelet 
transform of the acoustic time signal. It should be noted that 
most of the acoustic frequencies of biological significance 
are higher than what most cells can follow, so the filtering 
is generally done mechanically before detection by the 
receptor cells. The number of frequency channels may vary 
from very few in insects (Michelsen, 1992) to hundreds in 
many vertebrates (Echteler et al., 1994). 

Typically this filtering process is implemented in silicon 
using a cascade of second-order filters with progressively 
lower resonant frequencies. This cascade is intended to 
simulate the traveling wave of the mammalian cochlea. 
which starts in the basal (high-frequency) end of the cochlea 
and propagates towards the apical (low-frequency) end. For 
this purpose, subthreshold circuits have been most com- 
monly used (Mead, 1989; Fragniere et al., 1997; Sarpeshkar 
et al., 1998). 


Like the visual system, the auditory system must also deal 
with a wide range of signal levels. Here again, adaptation 
(automatic gain control) plays an important role. In mam- 
malian auditory systems the adaptation is specific to each 
frequency channel (Javel, 1986). In insects, responses of 
neurons in the central nervous system can also exhibit 
adaptation (e.g., see Lewis, 1992). 

Unlike the visual system, however, the auditory system is 
processing a very rapidly changing signal, one that often 
changes much faster than the biological hardware can fol- 
low. To circumvent the problem of following high-fre- 
quency signals, the receptor cells (hair cells) act as soft 
half-wave rectifiers (Mountain and Hubbard, 1996) so that 
at high frequencies they respond to the envelope of the 
acoustic signal rather than to the fine structure of the signal. 

In the auditory system, temporal cross-correlation and 
autocorrelation-like processing 1s believed to play an im- 
portant role (Colburn, 1996; Lyon and Shamma, 1996). In 
vertebrates, the time delay between the two ears (IID) ts an 
important cue for localization. The combination of neural 
delay lines and coincidence detection is used to cross- 
correlate the signals from the two ears for each frequency 
channel. Periodicity analysis is believed to take place also 
using delays and coincidence detection. Periodicity analysis 
no doubt plays an important role for many species from 
insects to man, because so many communication sounds 
involve periodic amplitude modulation (AM). Figure 4 il- 
Justrates time waveforms in which AM is a prominent 
feature for a cricket call (panel A) and for a human vowel 
(panel C). Panels B and D show the results of spectral 
analysis using a constant-Q filter bank, and except for center 
frequency and modulation rate, the AM signals are remark- 
ably similar. 


Olfactory Scene Analysis 


By analogy to the visual and auditory systems, we refer to 
the problem of identifying and Jocalizing odor sources in 
complex environments as olfactory scene analysis. Unlike 
vision and hearing, in which the signal propagates via wave 
phenomena, olfaction is characterized by mass transport by 
currents in water or air and the associated turbulence found 
in these media (Grasso, 2001). No direct information about 
source location is present in the received signal, but approx- 
imate direction can be estimated by sensing wind or water- 
flow direction. The only way a source can be located with 
any certainty is to trace the odor plume back to its source. 

In general, individual odor sources release mixtures of 
compounds into the environment. and the signal at the 
sensory organ is the result of the mixing of turbulent plumes 
from multiple sources. Due to the nature of turbulent trans- 
port, the plume produced by a single odor source 1s made up 
of a series of patches or filaments distributed within the 
plume: these move past the olfactory organ, creating a series 
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Figure 4. Spectral analysis of animal communication sounds. Time 
waveforms for a cricket call (panel A) and for a human vowel (panel C) are 
plotted along with the results of spectral analysis using a constant-Q filter 
bank (panels B and D). 


of odor pulses at the receptors with random arrival times, 
durations, and amplitudes (Moore and Atema, 1991). The 
patchy nature of odor concentration signals can be seen in 
the two concentration signals shown in Figure 5. In a 
multi-source environment, the odor pulses from one source 
will be intermixed with pulses from other sources. In such 
an environment, the average concentration of a compound is 


not a useful feature for olfactory scene analysis. Even if 


only one odor source is present. the statistical nature of the 
plume is such that several minutes of signal averaging are 
necessary to get an accurate estimate of average concentra- 
tion. However, behavioral experiments in plumes of this 
sort indicate that animals make olfactory decisions on the 
order of a few seconds (Basil and Atema, 1994). 

Like the visual and auditory systems, the olfactory sys- 
tem must be able to cope with wide ranges in signal (con- 
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centration) level. Olfactory receptors, like their counterparts 
in the other sensory systems, also exhibit adaptation that 
adjusts the sensitivity of individual receptors on the basis of 
background concentration levels. Olfactory systems have 
many different receptor types, ranging from a few dozen in 
insects to approximately 1000 receptor types in mammals. 
Some receptors, mainly those that have evolved to detect 
pheromones, are extremely selective, but most will respond 
to a number of different compounds. The higher the selec- 
tivity of a receptor, the higher the affinity for the odor 
molecule and the slower the release of the odor molecule 
after it has bound to the receptor (Lauffenburger and Lin- 
derman, 1993). The relationship between high affinity and 
slow release comes about because affinity depends on the 
ratio of the binding to unbinding rates. The rate of binding 
is limited by the rate at which the odorant can access the 
binding site, a rate that is similar for all receptors. Affinity, 
therefore, varies from receptor to receptor, largely due to 
differences in the unbinding rate. This relationship means 
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respond to the mingling of odor plumes from two different sources. The top 
two panels show the concentration signals from the two sources, and the 
bottom panel is the response from a simulation of 32 receptors that vary in 
their sensitivity to the two odorants. 
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that high molecular selectivity leads to poor temporal res- 
olution—not unlike the trade-off between frequency selec- 
tivity and temporal resolution in auditory filters. 

Olfactory receptors have been shown to respond rapidly 
enough that the temporal characteristics of the concentration 
signals could be available to the central nervous system 
(Gomez and Atema, 1996). Since most odors are mixtures 
and a single olfactory receptor cell can be stimulated by 
more than one compound, the odor from a single source will 
excite a number of different receptor cells. with the pattern 
of excitation varying from one odor mixture to another. In 
Figure 5 we simulate how an array of olfactory receptor 
cells might respond to the mingling of odor plumes from 
two different sources. The top two panels show the concen- 
tration signals from the two sources, and the bottom panel is 
the response from a simulation of 32 receptors that vary in 
their sensitivity to the two odorants. One can see from 
Figure 5 that, as in the auditory system, grouping can be 
done using temporal cues. In other words, receptors whose 
activities co-vary in time are likely to be responding to the 
same odor source. 

Hardware models of olfactory scene analysis have not 
progressed very far due to the lack of sensors with the 
combination of appropriate chemical selectivity and fast 
temporal responses. Most current experiments are being 
done with surrogate odor sources for which fast sensors are 
available. The systems used in these experiments are gen- 
erally designed to locate the odor source and not to classify 
the odor type. Due to the difficulty of accurately simulating 
chemical plumes in software, artificial systems for olfactory 
scene analysis often involve the use of robots. For example, 
we have used an aquatic robot (RoboLobster) that uses 
conductivity sensors to locate sources of salt in a freshwater 
flume (Grasso et al., 2000). 


Summary and Conelusions 


The comparisons of strategies for scene analysis across 
the three sensory modalities—visual, auditory, and olfac- 
tory—described above illustrate several common themes 
that operate across modalities. For example, we see the 
dissection of the sensory signal, its processing in parallel 
pathways to extract key features, and then the grouping of 
portions of the signal to form perceptual objects. In all three 
of these senses, adaptation plays an important role in the 
first stages of processing. Fundamental trade-offs such as 
spectral versus temporal resolution or molecular selectivity 
versus temporal resolution shape peripheral processing. The 
mathematical concept of cross-correlation and its neural 
counterpart, coincidence detection. show up over and over 
again. We believe that by comparing strategies for sensory 
processing across sensory modalities as well as across many 
different species we can derive fundamental principles of 
sensory information processing that can be used to design 


artificial systems capable of analyzing complex environ- 
ments. 
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